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Abstract 

In this paper we study the use of convolutional neural 
networks (convnets) for the task of pedestrian detection. 
Despite their recent diverse successes, convnets historically 
underperform compared to other pedestrian detectors. We 
deliberately omit explicitly modelling the problem into the 
network (e.g. parts or occlusion modelling) and show that 
we can reach competitive performance without bells and 
whistles. In a wide range of experiments we analyse small 
and big convnets, their architectural choices, parameters, 
and the influence of different training data, including pre¬ 
training on surrogate tasks. 

We present the best convnet detectors on the Caltech and 
KITTI dataset. On Caltech our convnets reach top perform¬ 
ance both for the Caltechlx and CaltechlOx training setup. 
Using additional data at training time our strongest convnet 
model is competitive even to detectors that use additional 
data (optical flow) at test time. 

1. Introduction 

In recent years the field of computer vision has seen an 
explosion of success stories involving convolutional neural 
networks (convnets). Such architectures currently provide 
top results for general object classification [25, 36], general 
object detection [40], feature matching [16], stereo match¬ 
ing [45], scene recognition [48, 8], pose estimation [41, 7], 
action recognition [23, 38] and many other tasks [35, 3], 
Pedestrian detection is a canonical case of object detection 
with relevant applications in car safety, surveillance, and 
robotics. A diverse set of ideas has been explored for this 
problem [13, 18, 12, 5] and established benchmark datasets 
are available [12, 17]. We would like to know if the success 
of convnets is transferable to the pedestrian detection task. 

Previous work on neural networks for pedestrian de¬ 
tection has relied on special-purpose designs, e.g. hand¬ 
crafted features, part and occlusion modelling. Although 
these proposed methods perform ably, current top meth¬ 
ods are all based on decision trees learned via Adaboost 
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Figure 1: Comparison of convnet methods on the Caltech 
test set (see section 7). Our CifarNet and AlexNet results 
significantly improve over previous convnets, and matches 
the best reported results (SpatialPooling-t, which ad¬ 
ditionally uses optical flow). 

[5, 47, 34, 28, 44]. In this work we revisit the question, 
and show that both small and large vanilla convnets can 
reach top performance on the challenging Caltech pedes¬ 
trians dataset. We provide extensive experiments regard¬ 
ing the details of training, network parameters, and different 
proposal methods. 

1.1. Related work 

Despite the popularity of the task of pedestrian detection, 
only few works have applied deep neural networks to this 
task: we are aware of only six. 

The first paper using convnets for pedestrian detection 
[37] focuses on how to handle the limited training data (they 
use the INRIA dataset, which provides 614 positives and 
1218 negative images for training). First, each layer is ini¬ 
tialized using a form of convolutional sparse coding, and the 
entire network is subsequently fine-tuned for the detection 
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task. They propose an architecture that uses features from 
the last and second last layer for detection. This method is 
named ConvNet [37]. 

A different line of work extends a deformable parts 
model (dpm) [15] with a stack of Restricted Boltzmann Ma¬ 
chines (RBMs) trained to reason about parts and occlu¬ 
sion (dbn-IsoI) [30]. This model was extended to ac¬ 
count for person-to-person relations (OBN-Mut) [32] and fi¬ 
nally to jointly optimize all these aspects: JointDeep [31] 
jointly optimizes features, parts deformations, occlusions, 
and person-to-person relations. 

The MultiSDP [46] network feeds each layer with con¬ 
textual features computed at different scales around the can¬ 
didate pedestrian detection. Finally SDN [27], the current 
best performing convnet for pedestrian detection, uses ad¬ 
ditional “switchable layers” (RBM variants) to automatic¬ 
ally learn both low-level features and high-level parts (e.g. 
“head”, “legs”, etc.). 

Note that none of the existing papers rely on a “straight¬ 
forward” convolutional network similar to the original Le- 
Net [26] (layers of convolutions, non-linearities, pooling, 
inner products, and a softmax on top). We will revisit this 
decision in this paper. 

Object detection Other than pedestrian detection, re¬ 
lated convnets have been used for detection of ImageNet 
[36, 25, 20, 40, 29, 39] and Pascal VOC categories [19, 2]. 
The most successful general object detectors are based on 
variants of the r-CNN framework [19]. Given an input im¬ 
age, a reduced set of detection proposals is created, and 
these are then evaluated via a convnet. This is essentially 
a two-stage cascade sliding window method. See [21] for a 
review of recent proposal methods. 

Detection proposals The most popular proposal method 
for generic objects is SelectiveSearch [42]. The re¬ 
cent review [21] also points out EdgeBoxes [49] as a fast 
and effective method. For pedestrian detection dbn-IsoI 
and DBN-Mut use dpm [15] for proposals. JointDeep, 
MultiSDP, and SDN use a HOGH-CSS-tlinear SVM detector 
(similar to [43]) for proposals. Only ConvNet [37] applies 
a convnet in a sliding fashion. 

Decision forests Most methods proposed for pedestrian 
detection do not use convnets for detection. Leav¬ 
ing aside methods that use optical flow, the current 
top performing methods (on Caltech and KITTl data¬ 
sets) are SquaresChnFtrs [5], InformedHaar [47], 
SpatialPooling [34], LDCF [28], and Regionlets 
[44]. All of them are boosted decision forests, and can be 
considered variants of the integral channels features archi¬ 
tecture [11]. Regionlets and SpatialPooling use 
an large set of features, including HOG, LBP and CSS, 
while SquaresChnFtrs, Inf ormedHaar, and LDCF 
build over HOGh-LUV. On the Caltech benchmark, the best 


convnet (sdn) is outperformed by all aforementioned meth¬ 
ods.' 

Input to convnets It is important to highlight that 
ConvNet [37] learns to predict from YUV input pixels, 
whereas all other methods use additional hand-crafted fea¬ 
tures. DBN-Isol and DBN-Mut use HOG features as 
input. MultiSDP uses HOGh-CSS features as input. 
JointDeep and SDN uses YUV-nGradients as input (and 
HOGh-CSS for the detection proposals). We will show in 
our experiments that good performance can be reached us¬ 
ing RGB alone, but we also show that more sophisticated 
inputs systematically improve detection quality. Our data 
indicates that the antagonism “hand-crafted features versus 
convnets” is an illusion. 

1.2. Contributions 

In this paper we propose to revisit pedestrian detection 
with convolutional neural networks by carefully exploring 
the design space (number of layers, filter sizes, etc.), and 
the critical implementation choices (training data prepro¬ 
cessing, effect of detections proposal, etc.). We show that 
both small (10® parameters) and large (6 • 10^ parameters) 
networks can reach good performance when trained from 
scratch (even when using the same data as previous meth¬ 
ods). We also show the benehts of using extended and ex¬ 
ternal data, which leads to the strongest single-frame de¬ 
tector on Caltech. We report the best known performance 
for a convnet on the challenging Caltech dataset (improv¬ 
ing by more than 10 percent points), and the hrst convnet 
results on the KITTI dataset. 

2. Training data 

It is well known that for convnets the volume of training 
data is quite important to reach good performance. Below 
are the datasets we consider along the paper. 

Caltech The Caltech dataset and its associated bench¬ 
mark [12,5] is one of the most popular pedestrian detection 
datasets. It consists of videos captured from a car traversing 
U.S. streets under good weather conditions. The standard 
training set in the “Reasonable” setting consists of 4 250 
frames with ~ 2 • 10® annotated pedestrians, and the test set 
covers 4 024 frames with ~ 1 • 10® pedestrians. 

Caltech validation set In our experiments we also use 
Caltech training data for validation. For those experiments 
we use one of the suggested validation splits [12]: the first 
five training videos are used for validation training and the 
sixth training video for validation testing. 

'Regionlets matches SpatialPooling on the KITTI bench¬ 
mark, and thus by transitivity would improve over SDN on Caltech. 



CaltechlOx Because the Caltech dataset videos are fully 
annotated, the amount of training data can be increased by 
resampling the videos. Inspired by [28], we increase the 
training data tenfold by sampling one out of three frames 
(instead of one out of thirty frames in the standard setup). 
This yields ^ 2 • 10"^ annotated pedestrians for training, ex¬ 
tracted from 42 782 frames. 

KITTI The KITTl dataset [17] consists of videos cap¬ 
tured from a car traversing German streets, also under good 
weather conditions. Although similar in appearance to Cal¬ 
tech, it has been shown to have different statistics (see [5, 
supplementary material]). Its training set contains 4445 
pedestrians (4 024 taller than 40 pixels) over 7 481 frames, 
and its test set 7 518 frames. 

ImageNet, Places In section 5 we will consider using 
large convnets that can exploit pre-training for surrogate 
tasks. We consider two such tasks (and their associated 
datasets), the ImageNet 2012 classification of a thousand 
object categories [25, 36, 19] and the classihcation of 205 
scene categories [48]. The datasets provide 1.2 • 10® and 
2.5 • 10® annotated images for training, respectively. 

3. From decision forests to neural networks 

Before diving into the experiments, it is worth noting that 
the proposal method we are using can be converted into a 
convnet so that the overall system can be seen as a cascade 
of two neural networks. 

SquaresChnFtrs [4, 5] is a decision forest, where 
each tree node pools and thresholds information from one 
out of several feature channels. As mentioned in section 1.1 
it is common practice to learn pedestrian detection convnets 
on handcrafted features, thus the feature channels need not 
be part of the conversion. In this case, a decision node can 
be realised using (i) a fully connected layer with constant 
non-zero weights corresponding to the original pooling re¬ 
gion and zero weights elsewhere, (ii) a bias term that applies 
the threshold, (iii) and a sigmoid non-linearity that yields a 
decision. A two-layer network is sufficient to model a level- 
2 decision tree given the three simulated node outputs. Fi¬ 
nally, the weighted sum over the tree decisions can be mod¬ 
elled with yet another fully-connected layer. 

The mapping from SquaresChnFtrs to a deep neural 
network is exact; evaluating the same inputs it will return 
the exact same outputs. What is special about the resulting 
network is that it has not been trained by back-propagation, 
but by Adaboost [6]. This network already performs bet¬ 
ter than the best known convnet on Caltech, SDN. Unfor¬ 
tunately, experiments to soften the non-linearities and use 
back-propagation to hne-tune the model parameters did not 
show signihcant improvements. 
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Figure 2: Illustration of the CifarNet, ^10® parameters. 


4. Vanilla convolutional networks 

In our experience many convnet architectures and train¬ 
ing hyper-parameters do not enable effective learning for di¬ 
verse and challenging tasks. It is thus considered best prac¬ 
tice to start exploration from architectures and parameters 
that are known to work well and progressively adapt it to 
the task at hand. This is the strategy of the following sec¬ 
tions. 

In this section we hrst consider CifarNet, a small net¬ 
work designed to solve the CIFAR-10 classihcation prob¬ 
lem (10 objects categories, (5 -f 1) • 10® colour images of 
32x32 pixels) [24]. In section 5 we consider AlexNet, a 
network that has 600 times more parameters than CifarNet 
and designed to solve the ILSVRC2012 classihcation prob¬ 
lem (1000 objects categories, (1.2 -(- 0.15) • 10® colour im¬ 
ages of ^VGA resolution). Both of these networks were in¬ 
troduced in [25] and are re-implemented in the open source 
Caffe project [22]^. 

Although pedestrian detection is quite a different task 
than CIFAR-10, we decide to start our exploration from the 
CifarNet, which provides fair performance on CIFAR-10. 
Its architecture is depicted in hgure 2, unless otherwise spe- 
cihed we use raw RGB input. 

We hrst discuss how to use the CifarNet network (sec¬ 
tion 4.1). This naive approach already improves over the 
best known convnets (section 4.2). Sections 4.3 and 4.4 ex¬ 
plore the design space around CifarNet and further push the 
detection quality. All models in this section are trained us¬ 
ing Caltech data only (see section 2). 

4.1. How to use CifarNet? 

Given an initial network specihcation, there are still sev¬ 
eral design choices that affect the hnal detection quality. We 
discuss some of them in the following paragraphs. 

Detection proposals Unless otherwise specihed we use 
the SquaresChnFtrs [4, 5] detector to generate propos¬ 
als because, at the time of writing, it is the best perform¬ 
ing pedestrian detector (on Caltech) with source code avail¬ 
able. In hgure 3 we compare SquaresChnFtrs against 

^http://caffe.berkeleyvision.org 


























Positives 

Negatives 

MR 

Window size 

MR 

Ratio 

MR 

GT 

Random 

83.1% 

32 X 32 

50.6% 

None 

41.4% 

GT 

loU < 0.5 

37.1% 

64 X 32 

48.2% 

1 : 10 

40.6% 

GT 

loU < 0.3 

37.2% 

128 X 64 

39.9% 

1 : 5 

39.9% 

GT, loU > 0.5 

loU < 0.5 

42.1% 

128 X 128 

49.4% 

1 : 1 

39.8% 

GT, loU > 0.5 

loU < 0.3 

41.3% 

227 X 227 

54.9% 



GT, loU > 0.75 

loU < 0.5 

39.9% 



Table 3: Detection quality 
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Figure 3: Recall of ground truth annotations versus 
Intersection-over-Union threshold on the Caltech test set. 
The legend indicates the average number of detection pro¬ 
posals per image for each curve. A pedestrian detector 
(SquaresChnFtrs [5]) generates much better proposals 
than a state of the art generic method (EdgeBoxes [49]). 

EdgeBoxes [49], a state of the art class-agnostic proposal 
method. Using class-specific proposals allows to reduce the 
number of proposals by three orders of magnitude. 

Thresholds for positive and negative samples Given 
both training proposals and ground truth (GT) annotations, 
we now consider which training label to assign to each pro¬ 
posal. A proposal is considered to be a positive example if 
it exceeds a certain Intersection-over-Union (loU) threshold 
for at least one GT annotation. It is considered negative if 
it does not exceed a second loU threshold for any GT an¬ 
notation, and is ignored otherwise. We find that using GT 
annotations as positives is benehcial (i.e. not applying sig- 
nihcant jitter). 

Model window size A typical choice for pedestrian de¬ 
tectors is a model window of 128 x 64 pixels in which the 
pedestrian occupies an area of 96 x 48 [9, 11, 4, 5]. It is 
unclear that this is the ideal input size for convnets. Des¬ 
pite CifarNet being designed to operate over 32 x 32 pixels, 
table 2 shows that a model size of 128 x 64 pixels indeed 
works best. We experimented with other variants (stretch¬ 
ing versus cropping, larger context border) with no clear 


improvement. 

Training batch In a detection setup, training samples are 
typically highly imbalanced towards the background class. 
Although in our validation setup the imbalance is limited 
(see table 3), we found it beneficial throughout our exper¬ 
iments to enforce a strict ratio of positive to negative ex¬ 
amples per batch of the stochastic gradient descend optim¬ 
isation. The final performance is not sensitive to this para¬ 
meter as long as some ratio (vs. None) is maintained. We 
use a ratio of 1 : 5. 

In the supplementary material we detail all other training 
parameters. 

4.2. How far can we get with the CifarNet? 

Given the parameter selection on the validation set from 
previous sections, how does CifarNet compare to previously 
reported convnet results on the Caltech test set? In table 4 
and figure 1 , we see that our naive network right away im¬ 
proves over the best known convnet (30.7% MR versus SDN 
37.9% MR). 

To decouple the contribution of our strong 
SquaresChnFtrs proposals to the CifarNet perform¬ 
ance, we also train a CifarNet using the proposal from 
JointDeep [31]. When using the same detection propos¬ 
als at training and test time, the vanilla CifarNet already 
improves over both custom-designed JointDeep and SDN. 

Our CifarNet results are surprisingly close to the best 
known pedestrian detector trained on Caltechlx (30.7% MR 
versus SpatialPooling 29.2% MR [34]). 

4.3. Exploring different architectures 

Encouraged by our initial results, we proceed to explore 
different parameters for the CifarNet architecture. 

4.3.1 Number and size of convolutional filters 

Using the Caltech validation set we perform a swipe of con¬ 
volutional filter sizes (3x3, 5x5, or 7x7 pixels) and number 
of filters at each layer (16, 32, or 64 filters). We include the 













Method 

Proposal 

Test MR 

Proposals of [31] 

- 

45.5% 

JointDeep 

Proposals of [3 1 ] 

39.3% [31] 

SDN 

Proposals of [3 1 ] 

37.9% [27] 

CifarNet 

Proposals of [3 1 ] 

36.5% 

SquaresChnFtrs 

- 

34.8% [5] 

CifarNet 

SquaresChnFtrs 

30.7% 


Table 4: Detection quality as a function of the method and 
the proposals used for training and testing (MR: log-average 
miss-rate on Caltech test set). When using the exact same 
training data as Joint Deep, our vanilla CifarNet already 
improves over the previous best known convnet on Caltech 
(SDN). 


Architecture MR 

layers 

CONVl CONV2 CONV3 (CifarNet, fig. 2) 37.1% 

3 CONVl CONV2LC 43.2% 

CONVl CONV2 FC 47.6% 

CONVl CONV2 CONV3 FC 39.6% 

CONVl CONV2CONV3 LC 40.5% 

CONVl CONV2 FCl FC2 43.2% 

CONVl CONV2 CONV3 CONV4 43.3% 

DAG CONVl CONV2 CONV3 CONCAT23 FC 38.4% 


Table 5: Detection quality of different network architectures 
(MR: log-average miss-rate on Caltech validation set), sor¬ 
ted by number of layers before soft-max. DAG: directed 
acyclic graph. 

full table in the supplementary material. We observe that 
using large filter sizes hurts quality, while the varying the 
number of filters shows less impact. Although some fluctu¬ 
ation in miss-rate is observed, overall there is no clear trend 
indicating that a configuration is clearly better than another. 
Thus, for sake of simplicity, we keep using CifarNet (32- 
32-64 filters of 5 x 5 pixel) in the subsequent experiments. 

4.3.2 Number and type of layers 

In table 5 we evaluate the effect of changing the number and 
type of layers, while keeping other CifarNet parameters fix. 
Besides convolutional layers (CONV) and fully-connected 
layers (FC), we also consider locally-connected layers (LC) 
[1], and concatenating features across layers (CONCAT23) 
(used in ConvNet [37]). None of the considered architec¬ 
ture changes improves over the original three convolutional 
layers of CifarNet. 

4.4. Input channels 

As discussed in section 1.1, the majority of previous con- 
vnets for pedestrian detection use gradient and colour fea¬ 


Input channels 

# channels 

CifarNet 

RGB 

3 1 

39.9% 

LUV 

3 

' 46.5% 

Gh-LUV 

4 

40.0% 

HOGh-L 

7 

'■ 36.8% 

HOGh-LUV 

10 j 

40.7% 


Table 6: Detection quality when changing the input chan¬ 
nels network architectures. Results in MR; log-average 
miss-rate on Caltech validation set. G indicates luminance 
channel gradient, HOG indicates G plus G spread over six 
orientation bins (hard-binning). These are the same input 
channels used by our SquaresChnFtrs proposal method. 

tures as input, instead of raw RGB. In table 6 we evaluate 
the effect of different input features over CifarNet. It seems 
that HOGh-L channel provide a small advantage over RGB. 

For purposes of direct comparison with the large net¬ 
works, in the next sections we keep using raw RGB as input 
for our CifarNet experiments. We report the CifarNet test 
set results in section 6. 

5. Large convolutional network 

One appealing characteristic of convnets is their ability 
to scale in size of training data volume. In this section we 
explore larger networks trained with more data. 

We base our experiments on the R-CNN [19] approach, 
which is currently one of the best performer on the Pascal 
VOC detection task [14]. We are thus curious to evaluate its 
performance for pedestrian detection. 

5.1. Surrogate tasks for improved detections 

The R-CNN approach (“Regions with CNN features”) 
wraps the large network previously trained for the ImageNet 
classification task [25], which we refer to as AlexNet (see 
figure 4). We also use “AlexNet” as shorthand for “R-CNN 
with AlexNet” with the distinction made clear by the con¬ 
text. During R-CNN training AlexNet is fine-tuned for the 
(pedestrian) detection task, and in a second step, the soft- 
max output is replaced by a linear SVM. Unless otherwise 
specified, we use the default parameters of the open source. 
Caffe based, R-CNN implementation^. Like in the previous 
sections, we use SquaresChnFtrs for detection proposals. 

Pre-training If we only train the top layer SVM, 
without fine-tuning the lower layers of AlexNet, we ob¬ 
tain 39.8% MR on the Caltech test set. This is already 
surprisingly close to the best known convnet for the task 
(SDN 37.9% MR). When fine-tuning all layers on Caltech, 
the test set performance increases dramatically, reaching 
25.9% MR. This confirms the effectiveness of the general 

^ https ;//github.com/rbgirshick/rcnn 
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Figure 4; Illustration of the AlexNet architecture, ^6 • 10^ parameters. 


R-CNN recipe for detection (train AlexNet on ImageNet, 
hne-tune for the task of interest). 

In table 7 we investigate the influence of the pre-training 
task by considering AlexNets that have been trained for 
scene recognition [48] (“Places”, see section 2) and on both 
Places and ImageNet (“Hybrid”). “Places” provides results 
close to ImageNet, suggesting that the exact pre-training 
task is not critical and that there is nothing special about 
ImageNet. 

CaltechlOx Due to the large number of parameters of 
AlexNet, we consider providing additional training data us¬ 
ing CaltechlOx for hne-tuning the network (see section 2). 
Despite the strong correlation across training samples, we 
do observe further improvement (see table 7). Interestingly, 
the bulk of the improvement is due to more pedestrians 
(Positives!Ox, uses positives from CaltechlOx and negat¬ 
ives from Caltechlx). Our top result, 23.3% MR, makes 
our AlexNet setup the best reported single-frame detector 
on Caltech (i.e. no optical flow). 

5.2. Caltech-only training 

To compare with CifarNet, and to verify whether pre¬ 
training is necessary at all, we train AlexNet “from scratch” 
using solely the Caltech training data. We collect results in 
table 7. 

Training AlexNet solely on Caltech, yields 32.4% MR, 
which improves over the proposals (SquaresChnFtrs 
34.8% MR) and the previous best known convnet on Cal¬ 
tech (SDN 39.8% MR). Using CaltechlOx further improves 
the performance, down to 27.5% MR. 

Although these numbers are inferior than the ones ob¬ 
tained with ImageNet pre-training (23.3% MR, see table 7), 
we can get surprisingly competitive results using only ped¬ 
estrian data despite the 10^ free parameters of the AlexNet 
model. AlexNet with CaltechlOx is second best known 
single-frame pedestrian detector on Caltech (best known is 
LDCF 24.8% MR, which also uses CaltechlOx). 


AlexNet Fine- SVM 

training tuning training 

Test MR 

Random none Caltechlx 

ImageNet none Caltechlx 

86.7% 

39.8% 

P-l-Imagenet 

P: Places Caltechlx Caltechlx 

ImageNet 

30.1% 

27.0% 

25.9% 

. PositiveslOx PositiveslOx 

ImageNet CaltechlOx CaltechlOx 

23.8% 

^5.5% 

^ , - Caltechlx 

Caltechlx ^ 

CaltechlOx 

32.4% 

32.2% 

- Caltechlx 

CaltechlOx ^ 

CaltechlOx 

27.5% 

SquaresChnFtrs [5] 

34.8% 


Table 7: Detection quality when using different training 
data in different training stages of the AlexNet: initial train¬ 
ing of the convnet, optional hne-tuning of the convnet, 
and the SVM training. PositiveslOx: positives from Cal¬ 
techlOx and negatives from Caltechlx. Detection proposals 
provided by SquaresChnFtrs, result included for compar¬ 
ison. See section 5.1 and 5.2 for details. 


5.3. Additional experiments 

How many layers? So far all experiments use the default 
parameters of R-CNN. Previous works have reported that, 
depending on the task, using features from lower AlexNet 
layers can provide better results [2, 35, 3]. Table 8 reports 
Caltech validation results when training the SVM output 
layer on top of layers four to seven (see hgure 4). We re¬ 
port results when using the default parameters and paramet¬ 
ers that have been optimised by grid search (detailed grid 
search included in supplementary material). 

We observe a negligible difference between default and op¬ 
timized parameter (at most 1 percent points). Results for 
default parameters exhibit a slight trend of better perform¬ 
ance for higher levels. These validation set results indicate 
that, for pedestrian detection, the R-CNN default parameters 
are a good choice overall. 





























































Parameters 

fc7 

fc6 

pools 

conv4 

Default 

32.2% 

32.5% 

33.4% 

42.7% 

Best 

32.0% 

31.8% 

32.5% 

42.4% 


Table 8; Detection quality when training the R-CNN SVM 
over different layers of the finetuned CNN. Results in MR; 
log-average miss-rate on Caltech validation set. “Best para¬ 
meters” are found by exhaustive search on the validation 
set. 

Effect of proposal method When comparing the per¬ 
formance of AlexNet fine-tuned on Caltech lx to the pro¬ 
posal method, we see an improvement of 9 pp (percent 
points) in miss-rate. In table 9 we study the impact of 
using weaker or stronger proposals. Both ACF [10] and 
SquaresChnFtrs [4, 5] provide source code, allow¬ 
ing us to generate training proposals. Katamari [5] 
and SpatialPooling-i- [34] are current top performers 
on the Caltech dataset, both using optical flow, i.e. ad¬ 
ditional information at test time. There is a ^10 pp 
gap between the detectors ACF, SquaresChnFtrs, and 
Katamari/SpatialPooling, allowing us to cover dif¬ 
ferent operating points. 

The results of table 9 indicate that, despite the 10 pp gap, 
there is no noticeable difference between AlexNet models 
trained with ACF or SquaresChnFtrs. It is seems that 
as long as the proposals are not random (see top row of table 
1), the obtained quality is rather stable. The results also in¬ 
dicate that the quality improvement from AlexNet saturates 
around ~ 22% MR. Using stronger proposals does not lead 
to further improvement. This means that the discriminative 
power of our trained AlexNet is on par with the best known 
models on the Caltech dataset, but does not overtake them. 

KITTI test set In figure 5 we show performance of 
the AlexNet in context of the KITTI pedestrian de¬ 
tection benchmark [17]. The network is pre-trained 
on ImageNet and fine-tuned using KITTI training data. 
SquaresChnFtrs reaches 44.4% AP (average preci¬ 
sion), which the AlexNet can improve to 46.9% AP. These 
are the first published results for convnets on the KITTI ped¬ 
estrian detection dataset. 

5.4. Error analysis 

Results from the previous section are encouraging, but 
not as good as could be expected from looking at improve¬ 
ments on Pascal VOC. So what bounds performance? The 
proposal method? The localization quality of the convnet? 

Looking at the highest scoring false positives paints a 
picture of localization errors of the proposal method, the 
R-CNN, and even the ground truth. To quantify this effect 
we rerun the Caltech evaluation but remove all false posit¬ 
ives that touch an annotation. This experiment provides an 


Fine- 

Training 

Testing 

Test MR 

A vs. 

tuning 

proposals 

proposals 

proposals 


ACF 

ACF 

34.5% 

9.7% 


SCF 

ACF 

34.3% 

9.9% 

lx 

ACF 

SCF 

26.9% 

7.9% 

SCF 

SCF 

25.9% 

8.9% 


ACF 

Katamari 

25.1% 

-2.6% 


SCF 

Katamari 

24.2% 

-1.7% 


SCF 

LDCF 

23.4% 

1.4% 

lOx 

SCF 

SCF 

23.3% 

11.5% 

SCF 

SP + 

22.0% 

-0.1% 


SCF 

Katamari 

21.6% 

0.9% 


ACF [10] 


44.2% 


SCF 

SquaresChnFtrs [5] 

34.8% 



LDCF [28] 


24.8% 



Katamari [5] 

22.5% 


SP+: 

SpatialPoolingi [33] 

21.9% 



Table 9: Effect of proposal methods on detection quality of 
R-CNN. Ix/lOx indicates fine-tuning on Caltech or Cal- 
techlOx. Test MR: log-average miss rate on Caltech test 
set. A: the improvement in MR of the rescored proposals 
over the test proposals alone. 


Architecture 

training 

# 

parameters 

Test MR 

Caltechlx CaltechlOx 

CifarNet 

~10'^ 

30.7% 

28.4% 

MediumNet 

~10® 

— 

27.9% 

AlexNet 

~10" 

32.4% 

27.5% 

SquaresChnFtrs [5] 

34.8% 


Table 10: Selection of results (presented in previous sec¬ 
tions) when training different networks using Caltech train¬ 
ing data only. MR: log-average miss-rate on Caltech test 
set. See section 6. 


upper bound on performance when solving localisation is¬ 
sues in detectors and doing perfect non-maximum suppres¬ 
sion. We see a surprisingly consistent improvement for all 
methods of about 2% MR. This means that the intuition we 
gathered from looking at false positives is wrong and ac¬ 
tually almost all of the mistakes that worsen the MR are 
actually background windows that are mistaken for pedes¬ 
trians. What is striking about this result is that this is not just 
the case for our R-CNN experiments on detection proposals 
but also for methods that are trained as a sliding window 
detector. 

6. Small or big convnet? 

Since we have analysed the CifarNet and AlexNet separ¬ 
ately, we compare their performance in this section side by 




















KITTI Pedestrians, moderate difficulty 



Recall 

Figure 5; AlexNet over on KITTI test set. 

side. Table 10 shows performance on the Caltech test set for 
models that have been trained only on Caltech lx and Cal- 
techlOx. With less training data the CifarNet reaches 30.7% 
MR, performing 2 percent points better than the AlexNet. 
On Caltech 1 Ox, we hnd the CifarNet performance improved 
to 28.4%, while the AlexNet improves to 27.1% MR. The 
trend conhrms the intuition that models with lower capacity 
saturate earlier when increasing the amount of training data 
than models with higher capacity. We can also conclude that 
the AlexNet would profit from better regularisation when 
training on Caltechlx. 

Timing The runtime during detection is about 3ms per 
proposal window. This is too slow for sliding window de¬ 
tection, but given a fast proposal method that has high recall 
with less than 100 windows per image, scoring takes about 
300ms per image. In our experience SquaresChnFtrs 
runs in 2s per image, so proposing detections takes most of 
the detection time. 

7. Takeaways 

Previous work suggests that convnets for pedestrian de¬ 
tection underperform, despite having involved architectures 
(see [5] for a survey of pedestrian detection). In this pa¬ 
per we showed that neither has to be the case. We present 
a wide range of experiments with two off-the-shelf models 
that reach competitive performance; the small CifarNet and 
the big AlexNet. 

We present two networks that are trained on Caltech 
only, which outperform all previously published convnets 
on Caltech. The CifarNet shows better performance than 
related work, even when using the same training data as 
the respective methods (section 4.2). Despite its size, the 
AlexNet also improves over all convnets even when it is 
trained on Caltech only (section 5.2). 

We push the state of the art for pedestrian detectors that 
have been trained on Caltechlx and CaltechlOx. The Ci- 



- 21.9% SpatialPooling+ 


10 -^ 10 -^ 10 -^ 10 " 

false positives per image 

Figure 6; Comparison of our key results (thick lines) with 
published methods on Caltech test set. Methods using op¬ 
tical flow are dashed. 

farNet is the best single-frame pedestrian detector that has 
been trained on Caltechlx (section 4.2), while AlexNet is 
the best single-frame pedestrian detector trained on Cal¬ 
techlOx (section 5.2). 

In figure 6, we include include all published methods on 
Caltech into the comparison, which also adds methods that 
use additional information at test time. The AlexNet that 
has been pre-trained on ImageNet reaches competitive res¬ 
ults to the best published methods, but without using addi¬ 
tional information at test time (section 5.1). 

We report first results for convnets on the KITTI ped¬ 
estrian detection benchmark. The AlexNet improves over 
the proposal method alone, delivering encouraging results 
to further push KITTI performance with convnets. 

8. Conclusion 

We have presented extensive and systematic experi¬ 
mental evidence on the effectiveness of convnets for pedes¬ 
trian detection. Compared to previous convnets applied to 
pedestrian detection our approach avoids custom designs. 
When using the exact same proposals and training data 
as previous approaches our “vanilla” networks outperform 
previous results. 

We have shown that with pre-training on surrogate tasks, 
convnets can reach top performance on this task. Interest¬ 
ingly we have shown that even without pre-training compet¬ 
itive results can be achieved, and this result is quite insens¬ 
itive to the model size (from 10® to 10^ parameters). Our 
experiments also detail which parameters are most critical 
to achieve top performance. We report the best known res- 






















ults for convnets on both the challenging Caltech and KITTI 
datasets. 

Our experience with convnets indicate that they show 
good promise on pedestrian detection, and that reported best 
practices do transfer to said task. That being said, on this 
more mature held we do not observe the large improvement 
seen on datasets such as Pascal VOC and ImageNet. 
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A. CifarNet training, the devil is in the details 

Training neural networks is sensitive to a large number 
of parameters, including the learning rate, how the network 
weights are initialised, the type of regularisation applied to 
the weights, and the training batch size. It is difficult to 
isolate the effects of the individual parameters, and the best 
parameters will largely depend on the specific setup. We 
report here the parameters we used. 

We train CifarNet via stochastic gradient descent (SGD) 
with a learning rate of 0.005, a momentum of 0.9, and a 
batch size of 128. After 60 epochs, we reduce the learning 
rate by a factor of 0.1 and train for an additional 10 epochs. 
Reducing the learning rate even further did not improve the 
classification accuracy. The other learning rate policies we 
explored yielded inferior performance (e.g. gradually redu¬ 
cing the learning rate each training iteration). Careful tun¬ 
ing of the learning rate while adjusting the batch size was 
critical. 

Other than the softmax classification loss, the training 
loss includes a L2 regularisation of the network weights. In 
the objective function, this regularization term has a weight 
of 0.005 for all layers but the last one (softmax weights), 
which receives weight 1. This parameter is referred in 
Caffe as “weight decay”. 

The network weights are initialised by drawing values 
from a Gaussian distribution with standard deviation a = 
0.01, with the exception of the first layer, for which we set 

CT = 0.0001. 

B. Grid search around CifarNet 

Table 11 shows the detection quality of different vari¬ 
ants of CifarNet obtained by changing the number and size 
of the convolutional filters of each layer. See related section 
4.3.1 of the main paper. Since different training rounds have 
different random initial weights, we train four networks for 
each parameter set and average the results. We report both 
mean and standard deviation of the miss rate on our valida¬ 
tion set. 

We observe that using either too small or too large fil¬ 
ter sizes throughout the network hurts quality. The network 
width also seems to matter, a network too narrow or too 
wide can negatively impact classification accuracy. All and 
all the “middle-section” of the table shows only small fluc¬ 
tuations in miss-rate (specially when considering the vari¬ 
ance). 

In addition to filter size and layer width, we also experi¬ 
mented with different types of pooling layers (max-pooling 
versus mean-pooling), see figure 2 of main paper. Other 
than on the first layer, replacing mean-pooling with max¬ 
pooling hurts performance. 

The results of table 2 indicate that there is no set of para¬ 
meters close to CifarNet with a clear advantage over the 


default CifarNet parameters. When going too far from Ci¬ 
farNet parameters, classification accuracy plunges. 

C. Grid search for AlexNet 

Table 12 presents the swipe of parameters used to con¬ 
struct the “Best parameters” entries in table 8 of the main 
paper. We vary the criterion to select negative samples and 
the SVM regularization parameters. Defaults are paramet¬ 
ers are loU < 0.5, and C = 10“^. 

Overall we notice that neither parameter is very sensitive 
(1^2 percent points fluctuations). When C is far from 
optimal large degradation is observed (10 per cent points). 
As seen in table 8 of the main paper the gap between default 
and tuned parameters is rather small (1^2 percent points). 

D. Datasets statistics 

In figure 7 we plot the height distribution for pedestrians 
in Caltech and KITTI training sets. Although the datasets 
are visually similar, the height distributions are somewhat 
dissimilar (for reference ImageNet and Pascal distributions 
are more look alike among each other). 

It was shown in [5] that models trained in each dataset, 
do not transfer well across each other (compared to models 
trained on the smaller INRIA dataset). 

E. Proposals statistics 

In figures 8 and 9 we show statistics of different detectors 
on the Caltech test set, including the ones we use as propos¬ 
als in our experiments. These figures complement table 9 of 
the main paper. 

Our initial experiments indicated that it is important to 
keep a low number of average proposals per image in or¬ 
der to reduce the false positives rate (post re-scoring). This 
is in contrast to common practice when using class-agnostic 
proposal methods, where using more windows is considered 
better because they provide higher recall [21 ]. We filter pro¬ 
posals via a threshold on the detection score. 

As can be seen in figure 8 a recall higher than 90% can be 
achieved with only ~3 proposals per image on average (for 
Intersection-over-Union threshold above 0.5, the evaluation 
criterion). The average number of proposals per image is 
quite low because most frames of the Caltech test set do not 
contain any pedestrian. 

In figure 9 we show the number of false positives at dif¬ 
ferent overlap levels with the ground truth annotations. The 
bump around 0.5 loU, most visible for SpatialPooling 
and LDCF, is an artefact of the non-maximum suppression 
method used by each method. Both these method obtain 
high quality detection, thus they must assign (very) low- 
scores to these false positives windows. To further improve 
quality the re-scoring method must do the same. 


filters 

Sizes 

16,16,16 

32 , 32 , 64 

32,64,32 

64,32,32 

32,32,32 

64,64,64 

64,32,16 

Mean 

3,3,3 

48.4 ± 1.7 

44.4 ± 1.0 

43.6 ±0.8 

45.1 ± 1.1 

45.2 ±0.7 

42.3 ± 1.3 

46.6 ±2.1 

45.1 

5,5,5 

42.7 ±4.2 

41.1 ± 1.3 

39.1 ±1.0 

38.9 ± 1.5 

37.8 ± 1.6 

38.3 ±2.5 

38.5 ± 1.3 

39.5 

7,5,3 

43.3 ±2.9 

38.7 ±2.4 

38.6 ±2.1 

38.8 ±0.9 

40.2 ±2.0 

37.9 ± 1.7 

39.7 ±0.7 

39.6 

7,5,5 

43.5 ±2.5 

40.2 ±0.9 

40.8 ±2.6 

38.4 ±0.9 

40.8 ± 1.5 

40.0 ±0.4 

41.7±2.5 

40.8 

7, 7,5 

43.5 ±2.7 

41.6 ±3.0 

43.3 ±6.1 

40.5 ±2.9 

39.8 ±2.5 

47.3 ±2.5 

41.6 ±2.0 

42.5 

Mean 

44.3 

41.2 

41.1 

40.4 

40.8 

41.2 

41.6 



Table 11: Detection quality (MR%) as the number of filters per layer (columns) and filter sizes per layer (rows). CifarNet 
parameters are highlighted in italic. (MR: log-average miss-rate on Caltech validation set). 
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(a) layer fc7 
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(b) layer fc6 
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Table 12: Detection quality (MR) as function of the maximal loU threshold to consider a proposal as negative example and 
the SVM regularization parameter C. (MR: log-average miss-rate on Caltech validation set) 
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(a) Caltech Reasonable training set 



(b) KITTI training set 

Figure 7: Histogram of pedestrian heights in different data¬ 
sets. 

When using a method for proposals one desires to have 
high recall with high overlap with the ground truth (figure 
8), as well has having false positives with low overlap with 
the ground truth (figure 9). False positive proposals over¬ 
lapping true pedestrians will have pieces of persons, which 
might confuse the re-scoring classifier. Classifying fully 
centred persons versus random background is assumed to 
be easier task. 

In table 9 of the main paper we see that AlexNet 
reaches top detection quality by improving over LDCF, 
SquaresChnFtrs, and Katamari. 



Figure 8; Recall of ground truth versus loU threshold, for a 
selection of detection methods. The curves are cumulative 
distributions. The detections have been filtered by score to 
reach ~3 proposals per image on average (number indicated 
in the legend). 



Figure 9: Distribution of overlap between false positives 
and ground truth of those false positives that do overlap with 
the ground truth. The curves are histogram with coarse loU 
bins. Number in the legend indicates the average number of 
proposals per image (after filtering to reach ~3). 
































